Version: 2.0

Reliability

For two decades, I’ve seen engineering teams chase “reliability” like a holy grail. We’ve poured resources into testing, monitoring, and complex architectures, all in the pursuit of a stable system. And while those things matter, I’ve consistently observed that true reliability isn't solved through technology alone. It’s a leadership problem, stemming from how we think about building software and manage the people who build it.

Think of the recent outages impacting major services – often, the root cause isn’t a single line of bad code, but a cascade of failures stemming from systemic issues and leadership blindspots. This isn't about blaming anyone. It's about recognizing that chasing ever-increasing numbers of nines (99.999% uptime, anyone?) without addressing the underlying leadership weaknesses is a costly and ultimately futile exercise. Let’s unpack that.

Beyond the Technical Checklist: What Really Drives Unreliability

We often define reliability in purely technical terms: minimizing bugs, robust error handling, comprehensive monitoring, and resilient infrastructure. These are necessary conditions, absolutely. But they are not sufficient. Here’s where the cracks start to show:

The Pressure Cooker of Short Cycles: The drive for “faster delivery” – a constant refrain in modern software development – frequently sacrifices quality. Delivering software at progressively shorter intervals always places the burden of ensuring quality on developers. And while conscientious developers will shoulder that responsibility, it's unsustainable and sets a dangerous precedent. It’s leadership’s job to protect the team from that pressure, not exacerbate it.
Ignoring the “Why” Behind Failures: Research consistently shows that software project failures are rarely solely technical. Lehtinen et al.'s work, for example, revealed that a significant portion of failures stem from inadequate requirements gathering, poor communication, and a lack of stakeholder alignment. These aren’t technical problems to be “fixed” with more code; they are leadership failures in setting direction and ensuring clarity.
The Myth of the "Hero" Developer: Blaming individuals for failures – even those who “own” a particular service – is a quick path to a toxic culture. Imagine a scenario where a critical service goes down, and the immediate response is to ask "Who broke it?". This creates fear and discourages transparency. While accountability is important, focusing solely on individual mistakes ignores systemic issues. A team constantly firefighting is a sign of failing leadership, not failing engineers. As the research suggests, more often than not, failures are rooted in behavioral patterns impacting collaboration and accountability, not technical skill.
Treating Reliability as an Afterthought: Reliability isn’t something you “add on” at the end of a sprint. It needs to be baked into the development process from the beginning. That means prioritizing technical debt reduction, investing in automated testing, and allocating time for refactoring – even when deadlines loom. These are strategic decisions that require strong leadership to champion.

A Leadership Framework for Building Reliable Systems

So, what can engineering leaders do to foster a culture of reliability? Here's a framework based on my experience:

Prioritize "Defensibility" Over "Velocity": Think long-term. Is your architecture built to withstand change? Will your market position be defensible in the face of competition? Reliability isn’t just about keeping the lights on today; it’s about building a system that can adapt and thrive in the future.
Embrace "Never Fail CD" as a Mindset: Continuous Delivery is powerful, but not at any cost. While the phrase implies perfection, it's about building safeguards to mitigate risk. Implement canary deployments, feature flags, and automated rollbacks. Treat every deployment as a potential failure point and plan accordingly. This isn’t about slowing down; it’s about being prepared.
Foster Psychological Safety: Encourage open communication. Make it safe to admit mistakes. Conduct blameless postmortems focused on learning and improvement, not assigning blame. Implement regular retrospectives to identify systemic issues. A team afraid to speak up will hide critical issues, leading to systemic failures.
Invest in "System Thinking": Encourage engineers to understand the entire system, not just their component. Promote cross-functional collaboration. Break down silos. The more everyone understands how their work impacts others, the more reliable the system will be. (A simple system diagram showing dependencies can be incredibly helpful here).
Lead by Example: Prioritize quality in your own work. Demonstrate a commitment to long-term sustainability. Advocate for the resources and time needed to build reliable systems.

Beyond Uptime: The Real Measure of Reliability

Ultimately, reliability isn’t just about achieving a certain uptime percentage. It’s about building trust with your users. It’s about delivering a consistently positive experience. It’s about creating a system that you can depend on, even when things go wrong.

And that requires more than just technical skill. It requires strong leadership, a commitment to long-term sustainability, and a culture of trust and accountability.

Beyond the Technical Checklist: What Really Drives Unreliability​

A Leadership Framework for Building Reliable Systems​

Beyond Uptime: The Real Measure of Reliability​

Beyond the Technical Checklist: What Really Drives Unreliability

A Leadership Framework for Building Reliable Systems

Beyond Uptime: The Real Measure of Reliability